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ABSTRACT SARI 1 is an ancient and diverse clade of heterotrophic bacteria that are abundant throughout the world's oceans, 
where they play a major role in the ocean carbon cycle. Correlations between the phylogenetic branching order and spatiotem- 
poral patterns in cell distributions from planktonic ocean environments indicate that SARI 1 has evolved into perhaps a dozen or 
more specialized ecotypes that span evolutionary distances equivalent to a bacterial order. We isolated and sequenced genomes 
from diverse SARI 1 cultures that represent three major lineages and encompass the full breadth of the clade. The new data ex- 
pand observations about genome evolution and gene content that previously had been restricted to the SARI 1 la subclade, pro- 
viding a much broader perspective on the dade's origins, evolution, and ecology. We found small genomes throughout the clade 
and a very high proportion of core genome genes (48 to 56%), indicating that small genome size is probably an ancestral charac- 
teristic. In their level of core genome conservation, the members of SARI 1 are outliers, the most conserved free-living bacteria 
known. Shared features of the clade include low GC content, high gene synteny, a large hypervariable region bounded by rRNA 
genes, and low numbers of paralogs. Variation among the genomes included genes for phosphorus metabolism, glycolysis, and 
CI metabolism, suggesting that adaptive specialization in nutrient resource utilization is important to niche partitioning and 
ecotype divergence within the clade. These data provide support for the conclusion that streamlining selection for efficient cell 
replication in the planktonic habitat has occurred throughout the evolution and diversification of this clade. 

IMPORTANCE The SARI 1 clade is the most abundant group of marine microorganisms worldwide, making them key players in 
the global carbon cycle. Growing knowledge about their biochemistry and metabolism is leading to a more mechanistic under- 
standing of organic carbon oxidation and sequestration in the oceans. The discovery of small genomes in SARI 1 provided crucial 
support for the theory that streamlining selection can drive genome reduction in low-nutrient environments. Study of isolates in 
culture revealed atypical organic nutrient requirements that can be attributed to genome reduction, such as conditional auxotro- 
phy for glycine and its precursors, a requirement for reduced sulfur compounds, and evidence for widespread cycling of CI com- 
pounds in marine environments. However, understanding the genetic variation and distribution of such pathways and charac- 
teristics like streamlining throughout the group has required the isolation and genome sequencing of diverse SARI 1 
representatives, an analysis of which we provide here. 
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Alphaproteobacteria of the SARI 1 clade are the most abundant 
group of planktonic cells in marine systems, typically ac- 
counting for -25% of prokaryotic cells in seawater worldwide (1, 
2). In addition to their importance to marine biogeochemical cy- 
cles, these highly successful organisms, along with genomes from 
Prochlorococcus (3), provided the first compelling evidence for the 
theory that streamlining selection has shaped the evolution of 
some major lineages of marine bacterioplankton. Cultivation of 
the temperate coastal SAR11 isolate "Candidatus Pelagibacter 
ubique" strain HTCC1062 and the subsequent sequencing of its 
genome revealed it possesses many unusual features for a free- 
living organism, including an extremely small, streamlined ge- 



nome with few paralogs, no pseudogenes, and many missing genes 
and pathways that are otherwise common in bacteria (4, 5). How- 
ever, the SAR11 clade is phylogenetically diverse, spanning 18% 
16S rRNA gene divergence (6) and encompassing at least a dozen 
ecotypes that are identified by their unique distributions in the 
environment (7-11; K. L. Vergin et al., submitted for publication). 
Wilhelm et al. (12) drew the conclusion that SAR11 genomes are 
highly conserved in gene content and synteny by comparing 
SARI 1 genome sequences with fragmentary SARI 1 sequence data 
extracted from Global Ocean Survey (GOS) metagenomes and 
measuring conservation of synteny and variation in gene-gene 
boundaries. They found that 96% of homologous fragments were 
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phylogenies (6). Here we present a de- 
tailed comparative analysis of these 
seven SAR11 genomes that provides 
new insight into the genome features 
and genetic content of this diverse 
group of globally abundant organ- 
isms. 
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FIG 1 16S phylogenetic tree of the SARI 1 clade (blue), showing a subset of major subclades defined here 
and elsewhere (6, 7) and the genomes included in this study (red). Bootstrap support is displayed at the 
nodes. Scale bar indicates 0.06 changes per position. 



conserved in gene order relative to the HTCC1062 genome. They 
also reported greater genomic rearrangement at operon boundar- 
ies than within operons, as well as hypervariable regions (HVRs), 
also termed genomic islands in Prochlorococcus (13), that ap- 
peared to have conserved locations within the genome, possibly 
allowing these cells to acquire novel genetic material with adaptive 
significance (12, 14). 

Comparative genomics with more strains offers a means to 
understand the evolutionary history of SARI 1, to confirm predic- 
tions, such as those of Wilhelm et al. (12), and to understand the 
functional significance of SARI 1 ecotype diversity in the oceans 
today. For example, it has been uncertain whether proteorhodop- 
sin (PR) (15, 16), CI and methyl group oxidation (17), or the 
requirements for reduced sulfur (18) and glycine/serine (19) are 
found throughout the clade. Advancements in high-throughput 
culturing techniques (20, 21) and knowledge obtained from our 
previous work has recently resulted in the successful culturing of 
representatives of SAR11 that span three divergent phylogenetic 
lineages of the proposed family "Pelagibacteraceae" (6) (Fig. 1). 
Five SAR11 strains (HTCC1062, HTCC1002, HTCC9565, 
HTCC721 1, and HIMB5) form a group of closely related lineages 
(16S identity > 98%; ANI [average nucleotide identity] > 75%) 
within SARI 1 subclade la, which is ubiquitous in geographic dis- 
tribution (1, 2). Strain HIMB114 is more distantly related (88% 
16S identity with HTCC1062) and is part of the subclade Ilia, 
which is a sister group to the freshwater SAR11 subclade Illb/ 
LD12 lineage (Vergin et al, submitted). The subclade Va strain 
HIMB59 is very distantly related (82% 16S identity with 
HTCC1062) but has been classified as a SAR11 strain based on 
monophyletic grouping with the other SAR11 strains using both 
16S (Vergin et al., submitted) (Fig. 1) and concatenated protein 



RESULTS 

General genome features. The strains 
in this study were isolated from sur- 
face seawater of disparate origin: 
HTCC1062 and HTCC1002 from the 
temperate coastal Northeast Pacific 
(4), HTCC9565 from the temperate 
open ocean of the Northeast Pacific, 
HTCC7211 from the Sargasso Sea in 
the subtropical Atlantic (21), and 
HIMB5, HIMB114, and HIMB59 
from the coastal tropical North Pacific 
(Table 1; see also Table S8 at http: 
//giovannonilab.science.oregonstate 
.edu/publications). The genomes of 
HTCC1062, HTCC1002, HTCC7211, 
HIMB5 and HIMB59 are closed, while 
the genomes of HIMB114 and 
HTCC9565 consist of scaffolds with 
one and three contigs, respectively. 
Based on synteny with the other ge- 
nomes of subclade la, the amount of 
missing information for the HTCC9565 genome is estimated to be 
from <1 to ca. 5.5 kbp. While the degree of completion of the 
HIMB114 genome is more difficult to estimate, a second recently 
sequenced subclade Ilia genome is complete at 1.285 Mbp (22), 
which is less than 50 kbp larger than the current HIMB114 se- 
quence. The presence of a compact (mean genome size of 1.337 ± 
0.08 Mbp), low G+C (28.6 to 32.3%) genome is a unifying char- 
acteristic of the SAR11 clade (Table 1). The genomes code for 
between 1,357 and 1,576 genes, one copy of the 5S, 16S, and 23S 
ribosomal RNA genes, and 30 to 35 tRNAs (see Table SI at the 
above URL). No pseudogenes were identified in any of the strains. 

The core and pan-genome of the Pelagibacteraceae. We in- 
vestigated the SARI 1 pan-genome, the total set of genes found in 
all seven genomes, by examining orthologous clusters (OCs) and 
excluding paralogs and non-protein-coding genes. The Pelagibac- 
teraceae pan-genome contains a total of 2,558 predicted OCs, with 
a conserved core genome of 705 OCs present in all SARI 1 strains 
(Fig. 2A). The "flexible" genome (genes found in one or more but 
not all genomes) contains 1,853 OCs, 997 unique and 856 shared 
non-core (Fig. 2B). The contribution of core, unique, and shared 
non-core OCs to the SARI 1 pan-genome changes considerably at 
different levels of phylogenetic similarity, with the number of or- 
thologs in the core genome (blue boxes) negatively correlated with 
evolutionary distance (Fig. 2B). When considering only the five 
SAR11 subclade la genomes, the pan-genome of 1,962 OCs con- 
sists of an even more conserved core genome of 1,060 OCs (Fig. 2B 
and C). The actual numbers of genes in the core and flexible ge- 
nomes differ by strain due to paralogs, detailed below. The pre- 
dicted size of the core SARI 1 and subclade la genomes was extrap- 
olated by fitting an exponential decay function to the average 
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TABLE 1 Characteristics of SARI 1 genomes used in this study 



Value or description for strain 



Characteristic 


HTCC1062 


HTCC1002 


HTCC9565 


HTCC7211 


HIMB5 


HIMB114 


HIMB59 


Subclade 


la 


la 


la 


la 


1.1 


ilia 


va 


Environment 


Coastal, temperate 


Coastal, temperate 


Open ocean, temperate 


Open ocean, subtropic 


Coastal, tropic 


Coastal, tropic 


Coastal, tropic 


Origin 


NF Parifir 


NF Parifir 


NF Parifir 

i > l_i £ tlLlllL 


S^rffaccn Qpa Aflantir 

OaltiaooU JCd; j\.l.lcLll L1L 


N Parifir 

IN . IT &L111L 


N Parifir 

IN . L C1L.111L. 


N Parifir 

IN ■ JT C1L.111L. 




1.309 


1.323 


1.280 


1.457 


1.343 


1.237 


1.410 


Status 


Closed 


Closed 


3 contigs 


Closed 


Closed 


1 contig 


Closed 


GC content 


29.7 


29.8 


28.9 


29.0 


28.6 


29.6 


32.3 


( % ) 
















Total 


1 394 


1 423 


1 386 


1 576 


1,467 


1 357 


1 532 


no. of genes 
















No. of protein 


1,354 


1,387 


1,352 


1,541 


1,431 


1,321 


1,493 


coding genes 
















%SAR11 


54.0 


52.6 


54.0 


48.5 


51.0 


56.4 


51.8 


core" 
















% SAR11 


1.4 


3.6 


7.1 


10.7 


7.8 


13.1 


26.2 


unique* 
















% subclade 


80.8 


78.4 


80.7 


72.9 


75.9 






la core" 
















% subclade 


1.4 


3.9 


7.6 


12.8 


10.9 






la unique" 

















a Percentage of total genes within the SARI 1 or SARI 1 subclade la core. 

b Percentage of total genes unique to individual strains compared with all seven SAR11 genomes (SARI 1 unique) or five SARI 1 subclade la genomes (subclade la unique). 



number of core OCs calculated for the sequential addition of the 
seven genome sequences (Fig. 3A), resulting in a predicted SARI 1 
core genome of 598 OCs. When the five SAR11 subclade la ge- 
nomes are considered separately, the predicted core genome of 
1,047 OCs closely matches the observed core genome size of 1,060, 
suggesting that the current subclade la core genome is well defined 
by the available genomes. 

To model the global SAR11 pan-genome, we applied the 
method described previously by Tettelin et al. (23), which predicts 
the number of new orthologs expected to be discovered with each 
additional sequenced genome, as well as to what degree the SARI 1 
pan-genome is open, meaning how many unique genes will be 
identified with each new sequenced strain. The number of new 
orthologs decreased as more strains were compared, resulting in 
an average value of 142 new orthologs per genome when all seven 
sequenced genomes are considered (Fig. 3B). When the five 
SARI 1 subclade la genomes were analyzed separately, the number 
of new orthologs added by the 5th strain was 105 on average. 
Power law regression analyses of the average number of new 
SARI 1 orthologs and average total pan-genome size resulted in 
values of a = 0.70 and /3 = 0.34 for the exponents (Fig. 3B and C). 
These values agree reasonably with the relation a = 1 — j8 as re- 
quired by Heaps' law applied to the pan-genome model (24), and 
a < 1 indicates an open pan-genome for SARI 1. While the SARI 1 
subclade la pan-genome is also open (a = 0.75 and j8 = 0.24), its 
smaller size and lower rate of growth reflect that this group is 
better defined by the current genomes than the entire clade. 

Comparison of total conserved gene content to that of other 
bacterial groups. To put the relative conservation of the SARI 1 
core genome in perspective, we compared our results to those 
from other comparative genome studies, including studies of en- 
vironmentally relevant prokaryotes and those with similar ge- 
nome sizes (Fig. 4; see also Table S2 at http://giovannonilab 
.science.oregonstate.edu/publications). Here we considered all 
genes, including paralogs, since ignoring duplications would arti- 



ficially inflate the amount of conservation, and calculated the 
number of core genes as a percentage of total genes for each SARI 1 
strain (Table 1). Pairwise average amino acid identity (AAI) for 
the SARI 1 clade follows the general trend for bacteria (Fig. 5A; see 
also Table S3 at the above URL) (25, 26) and, based on 16S rRNA 
gene comparisons (18%) and AAI comparisons, spans order-level 
divergence. In spite of this, the SARI 1 core genome represents 48 
to 56% of the total gene repertoire per strain and is similar in 
proportion to that of bacterial genera like Shewanella (7% 16S 
rRNA gene divergence) (27) (Fig. 4). The core genomes for groups 
with similar divergence at the 16S rRNA gene (Cyanobacteria [28] , 
Halobacteriaceae [29], Thermotogales [30], and Anaplasmataceae 
[31]) have smaller average conservation than the SAR11 core ge- 
nome. The most comparable values are those for the Anaplas- 
mataceae, composed of obligate intracellular symbionts with an 
even smaller average genome size than SAR11, and the thermo- 
philic/hyperthermophilic Thermotogales group. However, these 
two groups are less divergent than SAR11 in the 16S rRNA gene 
(16%). 

The core genomes of free-living microorganisms with a degree 
of 16S rRNA gene divergence similar to that of SARI 1 subclade la 
(2%), such as Prochlorococcus (3%) (32) or Rhodopseudomonas 
(3%) (33), are considerably less conserved (Fig. 4). The core ge- 
nome of 10 obligately intracellular Rickettsia strains, which are 
phylogenetically closely related to the SAR11 lineage and possess 
similarly small genome sizes (-1.2 Mbp), is also much less con- 
served than the SAR11 subclade la core genome (34). The only 
groups with more average core genome conservation were either 
less divergent at the 16S rRNA gene, obligate intracellular organ- 
isms, or both (23, 35, 36). Although the AAI values for SAR11 
subclade la are appropriate for a genus (25) (see Table S3 at http: 
//giovannonilab. science.oregonstate.edu/publications), core ge- 
nome conservation within subclade la is more similar to that of 
single bacterial species (e.g., Sulfolobus islandicus and Streptococ- 
cus agalactiae [23, 37]). 
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FIG 2 (A) Venn diagram showing the number of OCs shared between the 
SAR11 subclade la core genome, HIMB114, and HIMB59. (B) The relative 
contribution of core (blue), shared non-core (orange), and unique (red) or- 
thologs to the pan-genome at each level of divergence. The total size of each bar 
is proportional to the total number of orthologs in the pan-genome. The scale 
bar indicates 0.2 changes per position. The tree was redrawn based on the work 
of Thrash et al. (6). (C) Venn diagram showing the number of shared OCs 
among the five strains of SARI 1 subclade la. 
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FIG 3 SAR11 pan-genome analysis. The number of core genes (A), new 
orthologs (B), or total genes (pan-genome) (C) is plotted versus the sequential 
addition of genomes 7!(N!(7 — N)!). Squares show average values for all mem- 
bers of SAR11 (red) and SAR11 subclade la (blue). In panel A, the curve 
represents the least-squares fit of the average values to an exponential decay 
function, and the dotted line indicates the asymptotic values predicted for the 
SARI 1 and SARI 1 subclade la core genome size. Curves in panels B and C are 
from power law regression analyses. 
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with the megablast using default settings. The color code indicates average 
genome sizes. The dotted curve represents approximate average values 
taken from Fig. la in reference 25. The number of genomes compared per 
study and the average number of core genes can be found at 
http://giovannonilab.science.oregonstate.edu/publications. Anaplas., 
Anaplasmataceae (31); Chlamy., Chlamy diaceae; Chlamy.1, Chlamydo- 
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mydia trachomatis (35); Cyanob., cyanobacteria (28); E. rum., Ehrlichia 
ruminantium (36); Halob., Halobacteriaceae (29); Mycopl., Mycoplasma (89); 
Nitrob., Nitrobacter (90); Prochl., Prochlorococcus (32); Rhodops., Rhodopseudo- 
monas (33); Rickett., Rickettsia (34); Roseob., Roseobacter clade (47); Shew., She- 
wanella (27); S. agal., Streptococcus agalactiae (23); S. islandicus, Sulfolobus islandi- 
cus (37); Thermot., Thermotogales (30). 



Synteny. The conservation of gene order (synteny) within ge- 
nomes can be a strong indicator of conserved gene function and 
relatedness. Previous studies have demonstrated that synteny de- 
creases with phylogenetic distance, although this relationship var- 
ies depending on the group examined (38-40). A comparison of 
gene order conservation versus genome sequence similarity (av- 
erage bit score of protein-coding orthologs) demonstrated that 
the SARI 1 strains are on the extreme maximum edge of the range 
described by Yelton et al. (38), indicating much higher gene order 
conservation than most other organisms (Fig. 5B), consistent with 
predictions (12). 

Genome organization. To visualize global genome organiza- 
tion of core, additional subclade la core, shared non-core, and 
unique genes, we ordered the seven SARI 1 genomes by colocaliz- 
ing them at dnaA, adjacent to the origin in HTCC1062 (5), mov- 
ing clockwise toward dnaN (Fig. 6). Consistent with our calcula- 
tions showing high conservation of synteny, core (and additional 
subclade la core) genes are grouped in blocks throughout each 
genome (blue and green areas of Fig. 6). Shared non-core and 
unique genes are scattered throughout the genomes, though some 
areas of dense groupings are evident (Fig. 6; see also Fig. SI in the 
supplemental material). 

Previous work revealed HVRs — islands of low genomic re- 
cruitment of metagenomic data sets — in SAR11 (12, 14, 41). 
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order conservation versus average normalized bit score of protein-coding 
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defined as the fraction of genes shared by any two organisms that are syntenic 
(39); "n" is the same as in panel A. 



HVR2 from the work of Wilhelm et al. (12) is conserved in all 
seven SAR11 genomes, bounded by the 16S rRNA, tRNA Ile - GAT , 
tRNA Ala - TGC , 23S rRNA cassette on one side and 5S rRNA on the 
other in all genomes except HIMB59, which has HVR2 bounded 
by tRNA Ser_GGA and tRNA^ 3 " 000 genes. In HIMB59, the rRNA 
genes are in the same order but include the 5S rRNA as part of the 
operon (16S rRNA, tRNA Ue_GAT , tRNA Ak - TGC , 23S rRNA, and 5S 
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FIG 6 Circular representation of SARI 1 genomes. The genomes are arranged in order from the outermost to the innermost as follows: HTCC1062, HTCC1002, 
HTCC9565, HTCC7211, HIMB5, HIMB114, and HIMB59. Organisms are aligned with 0 at dnaA, sequences going clockwise to dnaN and continuing in the 
order in which they are presented at IMG. Blue, core SAR11 genes; bright green, additional SAR11 subclade la core genes; orange, shared non-core genes, red, 
unique genes; black, rRNA genes. The outer scale is measured in units of 10-kbp increments. HVR2 is highlighted in black. Gaps in complete genomes were 
necessary to display the genomes in this manner due to the disparity of genome sizes. 



rRNA) and are on the other side of the genome from HVR2. Nev- 
ertheless, in all strains, the HVR2 region remains similar in both 
size and location to the dnaAN locus (Fig. 6; see also Table S4 at 
http://giovannonilab.science.oregonstate.edu/publications) and 
comprises -50 protein coding genes except in HTCC7211 and 
HIMB59, where it comprises 83 and 74 genes, respectively. Con- 
sistent with initial observations (12), genes commonly found in 
HVR2 include glycosyltransferases, unknown membrane pro- 
teins, hypothetical proteins, and methyltransferases (see Table SI 



at the above URL). Probably because HTCC1062 and HTCC1002 
are the most closely related strains (AAI, -96%; ANI, 98%) (Fig. 5; 
see also Tables S3 and S5 at the above URL), they share all genes in 
HVR2. However, the remaining isolates contain large numbers of 
unique genes in this region (Fig. 6; see also Fig. SI in the supple- 
mental material), including some that appear to confer strain- 
specific metabolic abilities, such as sulfur metabolism genes 
unique to HTCC9565 and sugar transporters and phosphofruc- 
tokinase in HIMB59 that may be indicative of a unique niche for 
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this strain (discussed further below). HVR2 also contains a con- 
centration of paralogs for most strains (see Fig. S2). In contrast to 
the conserved location of HVR2, HVR1, -3, and -4, identified by 
Wilhelm et al. (12), are not conserved in other SAR11 strains 
outside of HTCC1062 and HTCC1002, although each of the other 
subclade la genomes (HTCC9565, HTCC72 1 1 , and HIMB5) pos- 
sesses distinct HVR-like regions where unique genes are clustered 
(Fig. 6; see also Fig. SI). 

Paralogs. One conspicuous feature previously identified 
within the streamlined genome of HTCC1062 was a low incidence 
of paralogs (5). Consistent with this finding and the general trend 
of decreasing numbers of paralogs with smaller genome size in 
Bacteria (42), the SAR11 genomes range from 4.3 to 15.2% of 
protein-coding genes as paralogs, averaging 7.8% (see Table S6 at 
http://giovannonilab.science.oregonstate.edu/publications). The 
proportion of strain-specific paralogs ranges from 18 to 61% of 
the total paralogous genes per genome, with similar distributions 
across genes in different categories of the pan-genome (Fig. 7A). 
Inparalogs and outparalogs are defined as duplications after or 



before a given speciation event, respectively 
(43). In this study, we characterized in- ver- 
sus outparalogs with phylogenetic trees to 
determine the number of gene duplications 
that occurred relative to the divergence of the 
SAR11 clade. Of >80% of paralogs with a 
reliable phylogenetic assignment, <19% are 
classified as outparalogs (see Table SI at the 
above URL). Thus, the majority of gene du- 
plications are predicted to have occurred 
since the divergence of the SAR11 lineages 
from a last common ancestor. Paralogs are 
concentrated in a few COG (Clusters of Or- 
thologous Groups) functional categories: en- 
ergy production and conversion (C), amino 
acid transport and metabolism (E), carbohy- 
drate transport and metabolism (G), cell 
wall/membrane/envelope biogenesis (M), 
and those with mixed designations (Fig. 7B; 
see also Fig. S3 in the supplemental material). 
Amino acid transport and metabolism (E) 
accounts for the largest number of paralogs 
in the seven SARI 1 genomes, of which 38% 
are found only in HIMB59 (see Table SI at 
the above URL). 

Conserved gene content of the Pe- 
lagibacteraceae. Similar to the core genomes 
of Prochlorococcus, the Roseobacter clade, and 
Shewanella, the SAR11 core genome pos- 
sesses a high proportion of genes coding for 
proteins involved in housekeeping functions 
and central metabolism, with a small fraction 
(2.1 to 3.7%) of core SAR11 genes not as- 
signed to COG functional categories (Fig. 8). 
The uncategorized fraction in the flexible ge- 
nome is much higher (23.5 to 34.2%) due to 
a larger proportion of putative and hypothet- 
ical genes. Compared to the core genome, the 
SARI 1 flexible genome includes an overrep- 
resentation of genes assigned to the COG cat- 
egories amino acid transport and metabo- 
lism (E), carbohydrate transport and metabolism (G), inorganic 
ion transport and metabolism (P), and general (R) and unknown 
(S) functions (Fig. 8A). 

Generally, SARI 1 cells are predicted to share a typical electron 
transport chain and a complete tricarboxylic acid cycle. In addi- 
tion, all of the SARI 1 genomes encode putative genes for the bio- 
synthesis of most of the 20 standard amino acids and some but not 
all vitamins and cofactors that are predicted to be required (see 
supplemental text at http://giovannonilab. science. oregonstate 
.edu/publications). All strains lack a phosphoenolpyruvate:sugar 
phosphotransferase transport system (PTS) but have a complete 
non-oxidative portion and an incomplete oxidative portion of the 
pentose phosphate shunt. 

All SAR11 genomes encode proteorhodopsin (PR), which is 
found in two groups that are specific for different light wave- 
lengths: green and blue absorbing (GPR and BPR, respectively) 
(44, 45). Whereas GPRs have been found to be highly abundant in 
the North Atlantic and surface waters of the Mediterranean Sea, 
BPRs dominate the open ocean, such as in the Sargasso Sea (14, 
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46). The two SAR11 strains isolated from open ocean sites, 
HTCC72 1 1 (Sargasso Sea) and HTCC9565 (northeastern Pacific), 
contain BPR, while the remaining coastal strains encode GPRs. 
Strain HIMB114 encodes an additional divergent PR gene with a 
currently unknown function and absorption spectrum. Putative 
genes for the biosynthesis of retinal from jS-carotene, crtlBY and 
blh (15), are present in all seven SAR11 genomes. Thus, the con- 
servation of PR and associated genes in all of the genomes exam- 
ined in this study demonstrates an important adaptive role for this 
gene across the SARI 1 clade, where it potentially facilitates sur- 
vival by providing ATP during periods of carbon limitation, as 
demonstrated for HTCC1062 (16). 

In SAR11, a high proportion (13 to 16%) of all protein-coding 
genes encode transport proteins, in comparison to all open read- 
ing frames (ORFs) in the Bacteria (-9%) and the Roseobacter clade 
(6%) (47, 48). Within this fraction, the most abundant form of 
transporters are primary active transporters, including ABC 
(ATP-binding cassette) and electrochemical potential-driven 
transporters (see Table S7 at http://giovannonilab. science 
.oregonstate.edu/publications). The core set of transporters found 
in all SARI 1 genomes includes ABC transport systems for general 
1-amino acids (encoded by yhdWXYZ), iron(III) (sfuABC), lipo- 



protein release (ycfUV), multidrug/ anti- 
bio tic (yadGH), and heme export (ccmD). 
Electrochemical potential-driven trans- 
porters for potassium (trkA) and manni- 
tol/chloroaromatic compounds (tripar- 
tite ATP-independent periplasmic 
[TRAP] type), a channel transport system 
for ammonium (amtB), and several in- 
completely characterized transport sys- 
tems are also present in all seven SARI 1 
genomes. 

Strain variation within the pan- 
genome of the Pelagibacteraceae. In 

spite of the highly conserved gene content 
in SARI 1 , we observed variability in some 
notable genes and pathways that have 
been considered enigmatic for the type 
strain, HTCC1062, and which may serve 
as lineage-specific adaptations. Glucose 
oxidation by a proposed variant of the 
Entner-Doudoroff (ED) pathway (49) is 
predicted only for HIMB5, HTCC1002, 
and HTCC1062. HIMB59 is the only ge- 
nome predicted to have a complete 
Embden-Meyerhof-Parnas (EMP) glyco- 
lysis pathway and genes for metabolism of 
other sugars (see supplemental text at 
http://giovannonilab.science.oregonstate 
.edu/publications). Furthermore, an ex- 
pansion of transporter paralogs in COG 
category G indicates that this microor- 
ganism may be adapted to use a variety of 
sugar compounds. Recent work has dem- 
onstrated that subclade Va organisms 
bloom at the surface in the Sargasso Sea 
during the same time periods as subclade 
la organisms there (Vergin et al., submit- 
ted), and thus carbohydrate utilization 
may allow HIMB59-type strains to co-occur with the numerically 
dominant SAR11 subclade la. However, HIMB59 does not have 
genes for the glyoxylate bypass, which is a conserved feature of 
subclade la genomes. SARI 1 genes for metabolism of one-carbon 
and methylated compounds (17) are conserved in subclade la or- 
ganisms, although they have variable distribution in the other two 
strains (see Fig. S4 in the supplemental material; see also supple- 
mental text at http://giovannonilab.science.oregonstate.edu 
/publications). HIMB5 and HIMB114 contain putative copies of 
aerobic carbon monoxide dehydrogenase (CODH) genes [cox- 
SLM) but, as with other SARI 1 strains, no genes for carbon fixa- 
tion. HIMB114 contains a complete serACB operon, which im- 
plies that this strain may be able to synthesize glycine de novo, in 
contrast with HTCC1062 (19) and other SAR11 subclade la or- 
ganisms (see supplemental text at the above URL). Sulfur and 
phosphate metabolism are also not conserved among SAR11 
strains. Genes for dimethylsulfoniopropionate (DMSP) transport 
and demethylation are missing from HIMB114, whereas 
HTCC9565 is the only strain that contains a predicted copy of 
sulfate adenylyltransferase (encoded by sat), which catalyzes the 
first step of sulfate reduction. Five of the seven strains contain the 
predicted high-affinity phosphate operon (pstSCAB-phoBU), but 
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only the HTCC7211 genome encodes genes for production and 
use of polyphosphate (ppx and ppk) as well as phosphonate trans- 
port (phnCDEE 2 ) and degradation (phnGHIJKLMN and phnZX) . 
The Pacific isolates HIMB1 14 and HIMB59 also possess genes for 
phosphonate transport, but phosphonate mineralization is likely 
to be restricted to specific compounds (see supplemental text at 
the above URL). 

DISCUSSION 

The observation of small genomes in SARI 1 strains of the la sub- 
clade provided strong support for streamlining theory and 
showed that streamlined heterotrophs could be highly successful 
interacting with complex organic carbon originating from phyto- 
plankton communities (5, 12). New results reported here extend 
this observation by showing that small genomes, high synteny, 
and conservation of the core genome are consistent qualities of 
isolates spanning the SARI 1 clade. It was previously hypothesized 
that paralogs of HTCC1062 were of ancient origin (5). However, 
our current phylogenetic designation of the majority of duplica- 
tions as inparalogs indicates that most gene duplication has hap- 
pened since the divergence of the SARI 1 lineage. The overall pau- 
city of paralogs in SAR11 genomes compared to those of other 
bacteria, especially those with free-living lifestyles (50), provides 
further support for the hypothesis that a streamlined genome was 
a feature of the last common ancestor of SARI 1. 

At -600 genes, the core genome for SAR11 provides an esti- 
mate of the lower limit of genes essential for maintenance of the 
free-living state in marine environments. Genome reduction in 
Prochlorococcus often includes the loss of gene families that are 
environmentally important but not essential in all water column 
environments, for example, genes for the uptake of macronutri- 
ents, such as compounds of P and N (32, 51). Coleman et al. (13) 
noted that the pattern of gene gain and loss in Prochlorococcus for 
genes involved in macronutrient (P and N) acquisition is often not 
congruent with phylogeny, suggesting that these genes play a role 
in the evolution of ecotypes (52). 

Similarly, we observed differential conservation of genes for 
iron and phosphorus metabolism in SAR11. As famously de- 
scribed in the "Iron Hypothesis," in some ocean surface waters, Fe 
is so low that it limits primary production over broad ocean re- 
gions (53). Smith et al. (54) described complex regulatory adap- 
tions to iron limitation in strain HTCC1062. Transcripts for most 
genes involved in iron metabolism increased in iron-limited cells, 
but only the iron-binding protein encoded by sfuC, a component 
of the predicted sfuABC transporter, increased in both mRNA and 
protein abundance during iron limitation. Two RNA-binding 
proteins (CspE and CspL), members of the cold-shock protein 
family, were postulated to play a role in a broad regulatory re- 
sponse that suppressed translation of nonessential transcripts. 
cspL, the ABC transporter genes (sfuABC), and the Fe-S synthesis 
operon (sufBCD) were all found in the SAR11 core genome, but 
the transcription factors encoded by fur and irr, which are re- 
ported to be involved in iron regulation in bacteria (55, 56), are 
conserved only in the SAR11 subclade la core genome. Iron is 
essential for respiration, and therefore the conservation of sfuABC 
and sufBCD is not surprising. The absence of iron-related regula- 
tory genes from the SARI 1 core genome suggests that iron metab- 
olism is constitutive in some strains and that the iron regulatory 
system does not consistently yield benefits in fitness across the 



clade, which ensures that these genes will be maintained by selec- 
tion. 

Phosphorous is probably the most common cause of nutrient 
limitation in the oceans (57). It is biologically accessible in a vari- 
ety of forms, most importantly as the phosphate ion but also as 
phosphonates, in which P and C atoms are linked directly by a 
bond. However, P availability relative to that of other nutrients 
differs across oceans. For example, the subtropical Atlantic Ocean 
is typically regarded as phosphate limited, whereas phosphate is 
thought to be more available in the central Pacific (e.g., see refer- 
ence 58). Thus, in contrast to iron, phosphate limitation is prob- 
ably a much less universal selective pressure, since no phosphate 
metabolism genes are conserved in all strains: HTCC1002 and 
HTCC9565 lack the high-affinity phosphate operon, and only 
HTCC721 1 contains genes for both polyphosphate and phospho- 
nate transport and degradation (see supplemental text at http: 
//giovannonilab. science.oregonstate.edu/publications). Recently, 
Coleman and Chisholm (58) found that the abundance of 
phosphate-related SARI 1 gene content in metagenomic data sets 
was higher for the phosphate-depleted waters of the Atlantic than 
for the Pacific. Similarly, genes for the metabolism of polyphos- 
phate have also been found to be more abundant in data sets 
collected from environments depleted in phosphate (59). Our 
data support these findings, since HTCC721 1 is also the only iso- 
late from the Atlantic Ocean, and demonstrate how the open 
SARI 1 pan-genome provides for ecosystem-specific adaptations. 

The values we report for core genome conservation and gene 
conservation as a function of 16S rRNA gene sequence similarity 
place the members of SAR11 as outliers among bacteria — the 
seven strains investigated shared -50% of total gene content 
across 18% divergence in 16S rRNA gene sequence (-44% AAI), 
while the average for bacteria is -20% shared genes at this level of 
divergence (25, 40). Early investigations of shared gene content by 
Konstantinidis and Tiedje (26) and Tamames (40) were extended 
later by Zaneveld et al. (60), who showed that conserved gene 
content tended to be higher among organisms from the same hab- 
itat. With the exception of the work of Zaneveld et al. (60), the 
studies referenced above did not examine the influence of com- 
mon ancestry on core genome conservation. To address this issue, 
in Fig. 4 we present an analysis of core genome content for a 
selection of microbial clades. This analysis shows that for many 
monophyletic groups, shared gene content is much higher than 
the averages reported by Konstantinidis and Tiedje (25) and that 
some, notably the Anaplasmataceae and the Thermotogales, are 
close to SAR11. 

Although unusual for comparative genomics studies, our find- 
ings are consistent with previous conclusions by analysis of gene- 
to-gene boundaries and the conservation of synteny in metag- 
enomic data (12). By comparing the genomes of two closely 
related SAR11 strains with metagenome data, Wilhelm et al. (12) 
concluded that selection was variable across the SARI 1 genome, 
leading to high apparent diversity in SARI 1 populations by com- 
mon metrics, while simultaneously maintaining a conservation of 
gene content and function. Comparatively high synteny across 
our genomes in spite of "typical" amino acid divergence with de- 
creasing 16S rRNA identity agrees with these conclusions, since 
gene order conservation implies likely gene function conservation 
(38). 

We also report the conservation of a hypervariable region 
(HVR2) across the clade. The presence of this variable genome 
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region, bounded by structural RNA genes, is evidence that mech- 
anistic restrictions to horizontal gene flow cannot be invoked to 
explain small genome size in SAR11 strains, providing further 
support for the conclusion that streamlining is a consequence of 
selective processes favoring genome minimization. Considering 
the enormous predicted population size of the SAR11 lineage 
globally (1), the open SAR11 pan-genome is apparently very large 
and is a significant genetic reservoir that can be exploited for adap- 
tive purposes. High rates of recombination between similar 
SAR11 cells, calculated to be as much as 60 X the mutation rate 
(61, 62), may provide one means of accessing such a reservoir. 
Genes found in HVR2 appear to augment function, analogously 
to genomic islands in genome-streamlined Prochlorococcus strains 
( 13), in some cases restoring basic metabolic pathways that are not 
conserved across the clade. For example, the sat gene in 
HTCC9565, in combination with aprBA and/or APS kinase, may 
allow it to utilize sulfate as a sulfur source. The phosphofructoki- 
nase in HIMB59, conferring this organism with a complete glyco- 
lysis pathway, is also located in HVR2 (see Table SI at http: 
//giovannonilab. science.oregonstate.edu/publications). 

The monophyly of the SAR11 clade has recently been called 
into question (63). However, in addition to the phylogenetic sup- 
port for inclusion of HIMB59 as a member of the SAR11 clade 
both with concatenated protein and 16S rRNA gene trees 
(Fig. 1) (6, 64; Vergin et al, submitted), the conservation of gene 
content, synteny, and the HVR2 region across these strains pro- 
vides additional evidence for the shared common ancestry of 
HIMB59 and other SAR11 strains. In fact, such unusually high 
conservation in these metrics across the clade raises the question 
of whether or not SARI 1 genomes may be evolving at an unusual 
rate compared to those of other organisms. The depth of branch- 
ing for SARI 1 in the 16S rRNA gene tree is comparable to that of 
the nearby Rickettsiales clade (Fig. 1). AAI versus 16S rRNA gene 
identity follows predictions from previous observations (Fig. 5A) 
(25), indicating that the 16S rRNA gene is not evolving indepen- 
dently from the rest of the genome. Furthermore, while synteny in 
SARI 1 genomes is higher than that in most other organisms, it is 
not unprecedented (Fig. 5B), falling near that in other organisms 
with small genomes. Thus, while unusual, all of these features are 
consistent with genome streamlining, which is expected to mini- 
mize genomes to a highly constrained set of genes that offer max- 
imum fitness. Since these organisms form a monophyletic group 
with a depth of branching comparable to that of the Rickettsiales 
and have minimum AAI and 16S rRNA gene identities of -44% 
and 82%, respectively, we therefore propose that the Pelagibacte- 
raceae be expanded to a novel order, the "Pelagibacterales." Based 
on the same metrics, subclade la organisms should be considered 
part of the genus "Candidatus Pelagibacter." 

Small genome size with a bias towards low GC content has also 
been observed in host-associated bacteria (65-67), as well as other 
free-living marine bacterioplankton, such as Prochlorococcus (3, 
68) and OM43 (69, 70). Whereas genetic drift coupled with re- 
laxed selection has been proposed as the driving force behind ge- 
nome reduction in host-dependent bacteria, selection for a more 
economical lifestyle is the purported pressure for genome reduc- 
tion in large populations of cells, such as those seen with Prochlo- 
rococcus and SARI 1 (5, 71, 72). In principle, small cells with small 
genomes require fewer resources, such as carbon, nitrogen, and 
phosphorus, to divide and also compete more efficiently for nu- 



trients because of their higher surface area-to-volume ratio (68, 
72,73). 

The paradox of genome streamlining is that small genomes are 
found in some planktonic marine bacteria but not others. Many 
common marine microbial lineages, such as the Roseobacter clade, 
Vibrio species, Photobacterium species, Pseudalteromas species, 
and Alteromonas species, have genomes of average size (47, 74- 
76). Plausible explanations for this paradox include differences in 
life cycle strategy and differences in N e fjb, the product of effective 
population size and mutation rate (77). Commonly, concepts 
such as generalist versus specialist (78), r strategist versus k strat- 
egist (79), and oligotroph versus copiotroph (76) are used to ex- 
plain variation in life cycle strategy, with large genomes often at- 
tributed to "generalists" (78). However, individual bacterial life 
cycle strategies may be complicated and elude accurate descrip- 
tion with these concepts; for example, many Vibrio spp. alternate 
between specific host associations and living freely suspended in 
the water column (80). Moreover, small genomes suggest special- 
ization, but the members of SAR11, which have small genomes, 
cannot plausibly be characterized as specialists, being one of the 
most successful and widely distributed chemoheterotrophic 
groups in the ocean (1). 

Novel cultivation approaches that favor oligotrophs, such as 
those we pioneered, are responsible for some of the most dramatic 
examples of genome reduction in free-living cells (5, 20, 69). Fol- 
lowing up on these observations, we reported unusual nutrient 
requirements in these strains and linked these requirements to 
genome reduction (16-19). We hypothesize that genome stream- 
lining may explain why many microorganisms that are abundant 
in nature are difficult to cultivate. This is a testable hypothesis. It 
predicts that those data emerging from single-cell genomics will 
show that small genomes are prevalent among uncultured taxa 
and that unusual nutrient requirements stemming from genome 
reduction explain the difficulty of their cultivation. 

MATERIALS AND METHODS 

Isolation of SAR11 strains. Strains HTCC1062 and HTCC1002 were iso- 
lated from the coastal Pacific Ocean, Newport, OR (4), strain 9565 was 
isolated from water collected above the luan De Fuca ridge, strain 
HTCC72 1 1 was isolated from the Bermuda Atlantic Time Series study site 
located in the Sargasso Sea (21), and strains HIMB114, HIMB5, and 
HIMB59 were isolated from tropical Kaneohe Bay, located on the north- 
eastern shore of the island of O'ahu, HI (see Table S8 at http: 
//giovannonilab.science.oregonstate.edu/publications). All strains were 
isolated using dilution-to-extinction methods (4, 20, 81). Following iso- 
lation, strains were grown in 100 liters of pristine seawater medium 
amended with low concentrations of inorganic nitrogen and phosphorus 
(1.0 uM NH 4 C1, 1.0 uM NaN0 3 , and 0.1 uM KH 2 P0 4 ) or nitrogen, 
phosphorus, organic carbon, and iron (18). Cells were collected on 0.1- 
/um-pore-size polyethersulfone membrane filters, and genomic DNA was 
isolated using a standard phenol-chloroform-isoamyl alcohol extraction 
protocol. 

Sequencing and annotation. Sequencing of the complete genomes of 
strains HTCC1062, HTCC1002, and HTCC7211 has been described pre- 
viously (5, 49). The genomes of strains HIMB5, HIMB59, and HIMB1 14 
were sequenced by the ]. Craig Venter Institute as part of the Moore 
Foundation Microbial Genome Sequencing project (http://camera.calit2 
.net/microgenome/). Strain HTCC9565 was sequenced by the IGI as part 
of the Community Sequencing Program. MIGS environmental metadata 
and sequencing details can be found elsewhere (see Table S8 at http: 
//giovannonilab.science.oregonstate.edu/publications). Functional an- 
notation was performed with the Integrated Microbial Genomes Expert 
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Review (IMG-ER) pipeline (82), except for the previously annotated 
strains HTCC1062 and HTCC1002 (see references 5 and 12, respectively), 
for which annotations were maintained (for details, see supplemental 
Methods at http://giovannonilab.science.oregonstate.edu/publications). 

Genome comparisons. We assessed homologous genes through the 
use of Hal, an automated pipeline for phylogenomic analysis (83). Hal 
initially performs an all-versus-all BLASTp analysis with all genome pro- 
tein fasta files, followed by Markov clustering (MCL) at 13 inflation pa- 
rameters (I). We chose to use clusters generated at I 1.5 because this is the 
default setting for OrthoMCL and was shown to have the highest accuracy 
for detecting orthologs with that software program (84). From there, we 
curated the clusters for accuracy with several filtering steps that flagged 
potential erroneous assignments of orthologs. Details of the filtering 
steps are provided elsewhere (see supplemental Methods at http: 
//giovannonilab. science.oregonstate.edu/publications), along with deter- 
mination of paralogs. In- and outparalogs were assessed according to the 
method of Sonnhammer and Koonin (43) with phylogenetic trees (see 
supplemental Methods). Synteny was determined using scripts and meth- 
ods from Yelton et al. (38). 

Core genome and pan-genome analyses. To calculate the core and 
pan-genomes, as well as the unique genes per strain, we made use of the 
heat map table created by Hal (see Table SI at http://giovannonilab 
.science.oregonstate.edu/publications). By curating this table with the al- 
terations from the filters above, we were then able to utilize a custom 
program, query, written for parsing data output from Hal (83), to calcu- 
late the shared/unique gene content for any combination of strains. Gene- 
gene boundary calculations were completed with the data in Table SI (see 
the above URL) using a custom Perl script, available on request. 

The sequential inclusion of seven genomes allows 7!(N>(7 — N)!) pos- 
sible combinations to calculate the core genome, new orthologs per ge- 
nome, and the pan-genome. The regression analysis of the SAR11 core 
genome, new orthologs, and pan-genome was performed as described by 
Tettelin et al. (23, 24). For details, see supplemental Methods at http: 
//giovannonilab. science.oregonstate.edu/publications. 

16S phylogeny. 16S rRNA gene sequences from SAR11 organisms 
with sequenced genomes and clone libraries were aligned with near neigh- 
bors identified by previous phylogenetic and phylogenomic tests of the 
Alphaproteobacteria (6, 85, 86). Sequences were aligned with the software 
program NAST (87) and lane masked at greengenes (http://greengenes.lbl 
.gov/), and the phylogeny was determined using the RAxML software 
program (88) (-f a -m GTRGAMMA -N 500). Accession numbers are 
provided in the supplemental Methods at http://giovannonilab. science 
.oregonstate.edu/publications. 

SUPPLEMENTAL MATERIAL 

Supplemental material for this article may be found at http://mbio.asm.org 
/lookup/suppl/doi: 10.11 28/mBio.00252- 1 2/-/DCSupplemental. 

Figure SI, PDF file, 0.3 MB. 

Figure S2, PDF file, 1.3 MB. 

Figure S3, PDF file, 0.1 MB. 

Figure S4, PDF file, 0.2 MB. 
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